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Abstract 



Flow cytometry is a technology that rapidly measures antigen-based markers associated to cells in a 
cell population. Although analysis of flow cytometry data has traditionally considered one or two markers 



at a time, there has been increasing interest in multidimensional analysis. However, flow cytometers are 

limited in the number of markers they can jointly observe, which is typically a fraction of the number 
p ^ of markers of interest. For this reason, practitioners often perform multiple assays based on different, 

overlapping combinations of markers. In this paper, we address the challenge of imputing the high 
^ dimensional jointly distributed values of marker attributes based on overlapping marginal observations. 

, ^ , We show that simple nearest neighbor based imputation can lead to spurious subpopulations in the 

imputed data, and introduce an alternative approach based on nearest neighbor imputation restricted to 

^ a cell's subpopulation. This requires us to perform clustering with missing data, which we address with 

ON 

a mixture model approach and novel EM algorithm. Since mixture model fitting may be ill-posed, we 

in 

ly-j also develop techniques to initialize the EM algorithm using domain knowledge. We demonstrate our 

approach on real flow cytometry data. 

o 

^T^ Index Terms 

> 

statistical file matching, flow cytometry, mixture model, probabilistic PCA, EM algorithm, imputation, 
incomplete data, clustering 



I. Introduction 

Flow cytometry is a technique for quantitative cell analysis |[1|. It provides simultaneous measurements 
of multiple characteristics of individual cells. Typically, a large number of cells are analyzed in a short 
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period of time - up to thousands of cells per second. Since its development in the late 1960s, flow 
cytometry has become an essential tool in various biological and medical laboratories. Major applications 
of flow cytometry include hematological immunophenotyping and diagnosis of diseases such as acute 
leukemias, chronic lymphoproliferative disorders, and malignant lymphomas pj. 

Flow cytometry data has traditionally been analyzed by visual inspection of one-dimensional histograms 
or two-dimensional scatter plots. Clinicians will visually inspect a sequence of scatter plots based on 
different pairwise marker combinations, and perform gating, the manual selection of marker thresholds, 
to eliminate certain subpopulations of cells. They identify various pathologies based on the shape of cell 
subpopulations in these scatter plots. There has been recent work, reviewed below, on automatic cell 
gating or classification of pathologies based on multidimensional analysis of flow cytometry data. 

Despite the promise of multidimensional analysis, this direction is limited by the number of markers that 
can be simultaneously measured, which is typically a fraction of the number of markers of interest. It is 
therefore common in practice to perform multiple assays based on different, overlapping combinations of 
markers. We may view these combinations as different marginals of the joint distribution of all observed 
markers. However, even when analysis is based on visual inspection of scatter plots, problems arise 
when the desired marker pair was not jointly measured. This situation arises frequently in the analysis 
of historical data. 

To address these issues and to facilitate higher dimensional analysis, we present a statistical method for 
file matching, which imputes higher dimensional flow cytometry data from multiple lower dimensional 
data files. While |[3| proposed a simple approach based on Nearest Neighbor (NN) imputation, this method 
is prone to induce spurious clusters, as we demonstrate below. Our method can improve the file matching 
of flow cytometry and is less likely to generate false clusters. 

In the following, we explain the principles of flow cytometry and introduce the file matching problem 
in the context of flow cytometry data. We then present an approach to file matching which imputes a 
cell's missing marker values with the values of the nearest neighbor among cells of the same type. To 
implement this approach we develop a method for clustering with missing data. We model flow cytometry 
data with a latent variable Gaussian mixture model, where each Gaussian corresponds to a cell type, and 
develop an expectation-maximization (EM) algorithm to fit the model. Since a large majority of all 
values are unobserved, most covariances cannot be estimated from the data. However, domain experts 
possess considerable knowledge about the characteristics of different cell types, and we incorporate 
this knowledge into the initialization of the EM algorithm. We compare our method with simple nearest 
neighbor imputation on real flow cytometry data, and show that our method offers improved performance. 
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Fig. 1. A flow cytometer system. As a stream of cells passes through a laser beam, the photo-detectors detect forward 
angle light scatter, side angle light scatter, and light emissions from fluorochromes. Then the digitized signals are analyzed in 
a computer. 



II. Background and Motivation 

In this section, we explain the principles of flow cytometry. We also define the statistical file matching 
problem in the context of flow cytometry data, and motivate the need for an improved solution. 

A. Flow Cytometry 

In flow cytometry analysis, a cell suspension is first prepared from peripheral blood, bone marrow, or 
lymph node. The suspension of cells is then mixed with a solution of fluorochrome-labeled antibodies. 
Typically, each antibody is labeled with a different fluorochrome. As the stream of suspended cells 
passes through a focused laser beam, they either scatter or absorb the light. If the labeled antibodies are 
attached to proteins of a cell, the associated fluorescent markers absorb the laser and emit light with 
the corresponding wavelength (color). Then a set of photo-detectors in the line of the light beam and 
perpendicular to the light capture the scattered and emitted light. The signals from the detectors are 
digitized and stored in a computer system. Forward scatter (FS) and side scatter (SS) signals as well as 
various fluorescence signals are collected for each cell (see Fig. [T]). 

In a flow cytometer that is capable of measuring d attributes, called markers, the measurements of 
each cell can be represented with a d-dimensional vector x = {x^^\ x^'^\ ■ ■ ■ jX^'^^) where x*^^) is FS, 
j;^^) is SS, and x*^^^ • • • ,x^'^^ are the fluorescent markers. Thus, the accumulation of N cells forms a 
N X d matrix. 
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The detected signals provide information about the physical and chemical properties of each cell 
analyzed. FS is related to the relative size of the cell and SS is related to its internal granularity or 
complexity. The fluorescence signals reflect the abundance of expressed antigens on the cell surface. 
These various attributes are used for identification and quantification of cell populations. FS and SS are 
always measured, while the marker combination is a part of the experimental design. 

Flow cytometry data is usually analyzed using a sequence of one dimensional histograms and two or 
three dimensional scatter plots by choosing a subset of one, two or three markers. The analysis typically 
involves manually selecting and excluding cell subpopulations, called gating, by thresholding and drawing 
boundaries on the scatter plots. Clinicians routinely diagnose by visualizing the scatter plots. 

Recently, some attempts have been made to analyze directly in high dimensional spaces by mathemat- 
ically modeling flow cytometry data. In Q, Q, a mixture of Gaussian distributions is used to model cell 
populations, while a mixture of f-distributions with a Box-Cox transformation is used in ||6|. A mixture 
of skew t-distributions is studied in |7|. The knowledge of experts is sometimes incorporated as prior 
information |8j|. Instead of using finite mixture models, some recent approaches proposed information 
preserving dimension reduction to analyze high dimensional flow cytometry data ||9|, |10|. However, 



standard techniques for multi-dimensional flow cytometry analysis are not yet established. 

B. Statistical File Matching 

The number of dimensions in flow cytometry is limited by the number of light sources and detectable 
fluorochrome markers, and available reagent combinations. Even though recent innovations have enabled 
measuring near 20 cellular attributes, there are typically dozens or hundreds of markers of interest in 
a given flow cytometry experiment. Furthermore, instruments deployed in clinical laboratories still only 



measure 5-7 markers simultaneously 1 1 1 1 . 



Being unable to simultaneously measure all markers of interest, it is common to divide a sample into 
several "tubes" and stain each tube separately with a different set of markers [12|. In practice, partially 
overlapping marker combinations are used to help identify cell populations (see Fig. |2]). The marker 
combinations are designed based on which markers need to be observed together. However, it is not 
always possible to anticipate all marker combinations of potential interest. 

In the sequel, we present a method that generates flow cytometry data in which all the markers of 
interest are available for the union of cells. Thus, we obtain a single higher dimensional dataset beyond 
the current limits of instrumentation. Then pairs of markers that are not measured together can still be 
visualized through scatter plots, and methods of multidimensional analysis may be applied to the full 
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Fig. 2. Flow cytometry analysis on a large number of antibody reagents within a limited capacity of a flow cytometer. A 
sample from a patient is separated into multiple tubes with which different combinations of fluorochrome labeled antibodies are 
stained. Each output file contains at least two variables, FS and SS, in common as well as some variables that are specific to 
the file. 

common specific 1 specific2 

C Si S2 

file 1 (iVi) 



file 2 (iVa) 

Fig. 3. Data structure of two incomplete data files. Two files have some overlapping variables c, and some variables si and 
32 that are never jointly observed. File matching combines the two files by completing the missing blocks of variables. 

dataset. 

This technique, called file matching, merges two nor more datasets that have some commonly observed 
variables as well as some variables unique to each dataset. An exemplary two file case is drawn in Fig. 
[3] Each unit (cell) x„ is a vector in and belongs to one of the data files (tubes) Xi or X2, where each 
file has A'^i and N2 units, respectively. While variables c are observed in all the units, units in Xi have 
variables S2 missing and units in X2 have variables si missing, where si, S2, and c represent specific and 
common variable sets. If the observed and missing components of a unit x„ are denoted by o„ and m„, 
then o„ = c U si and nin = S2 for € Xi, and o„ = c U S2 and rUn = si for x„ € ^2. 

The file matching problem is a missing data problem where blocks of missing data need to be imputed. 
Among imputation methods, algorithms using conditional mean or regression are most common. As shown 
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in Fig. |4j however, these imputation algorithms tend to shrink the variance of data. Thus, these approaches 
are inappropriate in flow cytometry where the shapes of cell populations are important in analysis, and 
the preservation of variability after file matching is highly desired. More discussions on missing data 
analysis and file matching can be found in [13] and [14]. 

true NN 




200 400 600 800 1000 200 400 600 800 1000 

CD45 CD45 



Fig. 4. Examples of imputation methods: NN, conditional mean, and regression. The NN method relatively well preserves the 
distribution of imputed data, while other imputation methods such as conditional mean and regression significantly reduce the 
variability of data. 

||3) proposed to use Nearest Neighbor (NN) imputation to match flow cytometry data files. In their 
approach, missing variables of one unit, called the recipient, are imputed with observed variables from 
a unit in the other file, called the donor, that is most similar. If Xj is a unit in Xi, the missing variables 
are set as follows 

x^^ = x**^ where x* = argmin ||x^ — x^||2. 

Note that the similarity is based on the distance in the projected space of jointly observed variables. This 
algorithm is advantageous over other imputation algorithms, based on conditional mean or regression, 
as displayed in Fig. |4] It generally preserves the distribution of cells, while the other methods cause the 
variance structure to shrink toward zero. 
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Fig. 5. Comparison of results for two imputations methods to the ground truth cell distribution. Figures show scatter plots 
on pairs of markers that are not jointly observed. The middle row and the bottom row shows the imputation results from the 
NN and Cluster-based NN, respectively. The results from the NN method show spurious clusters conspicuously in the right two 
panels. The false clusters are indicated by dotted circles in CD3 vs. CDS and CD3 vs. CD4 scatter plots. On the other hand, 
the results from our proposed approach better resemble the true distribution on the top row. 



However, the NN method sometimes introduces spurious clusters into the imputed results and fails 
to replicate the true distribution of cell populations. Fig. |5] shows an example of false clusters from the 
NN imputation algorithm (for detailed experimental setup, see Section [V]l. We present a toy example to 
explain why the NN imputation can fail, and to motivate our approach. 
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C. Motivating Toy Example 

Fig. [5] shows a toy example dataset in M'^. In the two data files, two of three features of these points 
are observed: c and si in tile 1, and c and S2 in file 2. Each data point belongs to one of two clusters, 
but its label is unavailable. 

When imputing feature si of units in file 2, the NN algorithm produces four clusters whereas there 
should be two, as shown in Fig. [6] (d). This is because the NN method uses only one feature, and fails 
to leverage the information about the joint distribution of variables that are not observed together. On 
the other hand, if we can infer the cluster membership of data points, the NN imputation can be applied 
within the same cluster. Hence, we search a donor from the subgroup (1) for the data points in (3), and 
likewise we search a donor from (2) for the points in (4) in the example. Then the file matching result 
greatly improves and better replicates the true distribution as in Fig. [6] (e). 



(a) True (b) file 1 (c) file 2 




(d) NN (e) cluster - NN (f) (incorrect) cluster - NN 




Fig. 6. Toy example of file matching. Two files (b) and (c) provide partial information of data points (a) in R^. The variable 
c is observed in both files while si and S2 are specific to each file. The NN method creates false dot populations in the si vs. 
S2 scatter plot in (d). On the other hand, the NN applied within the same cluster successfully replicated the true distribution. If 
the cluster are incorrectly paired, however, the Cluster-NN approach fails, as in (f). 



In this example, as in real flow cytometry data, there is no way to infer cluster membership from the 

March 30, 2010 DRAFT 



9 



Input: two files Xi and X2 to be matched 

1. Cluster the units in Xi and A'2. 

2. Perform NN imputation within the same cluster. 
Output: statistically matched complete files Xi and X2 



Fig. 7. The description of the Cluster-based NN algorithm for two files. For two input flow cytometry data files Xi and X2, 
specific variables are imputed using NN method after clustering each cell into one of K clusters. 



data alone, and incorrect labeling can lead to poor results (Fig. |6] (f)). Fortunately, in flow cytometry we 
can incorporate prior knowledge to achieve an accurate clustering. 

III. Cluster-based Imputation of Missing Variables 
We first focus on the case of matching two files. The case of more than two files is discussed in Section 



VI For the present section, we assume that Xi and X2 have both been partitioned into K clusters. Let 
and X2 denote the cells in Xi and X2 from the kth cluster, respectively. 
Suppose that the data is configured as in Fig. [3] In order to impute the missing variables of a unit in 
file 1 , we locate a donor among the data points in file 2 that have the same cluster label as the recipient. 
When imputing incomplete units in file 2, the roles change. The similarity between two units is evaluated 
on the projected space of jointly observed variables, while constraining both units to belong to the same 
cluster. Then we impute the missing variables of the recipient by patching the corresponding variables 
from the donor. More specifically, for Xj G X^, we impute the missing variables by 



X- ' = X . " where x,- = argmm x,- — xJ 



2 



The proposed Cluster-based NN imputation algorithm is summarized in Fig. [7] 

In social applications such as survey completion, file matching is often performed on the same class 
such as gender, age, or county of residence. However, this information that is used to label each unit is 



available in data, and the inference as in our algorithm is not necessary 1 14|. 



IV. Clustering with Missing Data 

To implement the above approach, we view Xi and X2 as a single data set and cluster its elements. We 
propose a method for clustering with missing data based on a finite Gaussian mixture model. Mixture 
models are common models for flow cytometry where each component corresponds to a cell type. 
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While non-Gaussian models might provide a better fit, there is a trade-off between estimation error 
and approximation error. More complicated models tend to be more challenging to fit. Furthermore, even 
with an imperfect data model, we may still achieve an improved file matching. 

Thus, clustering amounts to fitting the parameters of the mixture model. In general, fitting such a model 
is ill-posed. For example, in the toy example, there is no way to know the correct cluster inference based 
solely on the data. However, we can leverage domain knowledge to select the number of components 
and initialize model parameters. 

A. Mixture of PPCA 

In a mixture model framework, the probability distribution of a d-dimensional data vector x takes the 
form 

K 

= ^7rfepfc(x) 

k=l 

where K is the number of components in the mixture and vr^, is a mixing weight. 

In flow cytometry, mixture models are common models of cell subpopulations. Mixture models with 
Gaussian components are common Q, Q, ||8|, although distributions with more parameters, such as 
t-distributions or skew i-distributions, have been proposed |[6|, ||7|. However, these models require 
estimating a large number of parameters, and it becomes difficult to obtain reliable estimates when 
the number of components or the dimensions of the data increase. In this application, the model needs 
not be perfect to get improved imputation. We adopt a probabilistic principal component analysis (PPCA) 
mixture model as a way to model cell populations with fewer parameters. Without PPCA, our experience 
has revealed that even a Gaussian mixture model may have too many parameters to be accurately fit. 

PPCA was proposed by [15] as a probabilistic interpretation of PCA. While conventional PCA lacks 
a probabilistic formulation, PPCA specifies a generative model. It is a latent variable model, in which a 
data vector is linearly related to a latent variable. The latent variable space is generally lower dimensional 
than the ambient variable space, so the latent variable provides an economical representation of the data. 

The PPCA model is built by specifying a conditional distribution of a data vector x in M'^, given a 
latent variable t in R^: 

p(x|t) = AA(Wt + /I, fj^I) 
where is a d-dimensional vector and W is a d x g linear transform matrix. The latent variable is also 



March 30, 2010 



DRAFT 



11 



assumed to be Gaussian with p(t) = J\f{0, 1). Then the marginal distribution of x is also Gaussian: 

p(x)=AA(/x,C) 

with a covariance matrix C = WW^ + cr^I. The posterior distribution can be shown to be Gaussian as 
well: 

where M = W^W + fi^I is a q x q matrix. 

The PPCA mixture model is a combination of multiple PPCAs. Each PPCA component explains local 
data structure or cell subpopulation. The model is defined by the collection of each component parameters 

= {tt/c, /^fci Cfc}- From a flow cytometry dataset X = {xi, • • • , Xjv}, an EM algorithm can learn 
the mixture model by iteratively computing these parameters. More details on the PPCA mixture and the 
EM algorithm are explained in |16J 

The mixture of PPCA offers a way of controlling the number of parameters to be estimated without 
completely sacrificing the flexibility of model. In mixture model framework, a more common choice is 
the standard Gaussian mixture model. In the Gaussian mixture model, however, each Gaussian component 
requires d{d + l)/2 covariance parameters to be estimated if a full covariance matrix is used. Thus, as 
the data dimension increases, more data points are needed for reliable estimation of those parameters. 
The number of parameters can be reduced by constraining the covariance matrix to be isotropic or 
diagonal. These are too restrictive, however, since an isotropic or diagonal covariance makes the Gaussian 
component spherical or, respectively, elliptical aligned along the data axes; hence, the correlation structure 
between variables cannot be captured. On the other hand, the PPCA mixture model lies between those 
two extremes, and allows to control the number of parameters by specifying q, the dimension of the 
latent variable. 

B. Mixture of PPCA with Missing Data 

Even though our file matching problem has a particular pattern of missing variables, we develop a 
more general algorithm that allows for an arbitrary pattern of missing variables. Our development assumes 
values are "missing at random," meaning that whether a variable is missing or not, is independent of 
its value [ 13J . Our algorithm may be viewed as an extension of the algorithm of [llj to PPCA, or the 
algorithm of ||T6l to data with missing values. 

Denoting the observed and missing components by o„ and m„, each data point can be divided x„ = 
(x°",x™"). In a missing data problem, a set of partial observations {x'j'^ • • • ,x^} is given. Similar 
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to EM algorithms for Gaussian mixture models, we introduce indicator variables z„. One and only one 
entry of z„ is nonzero, and Znk = 1 indicates that the kth component is responsible for generating x„. 
We also include the missing components x^" and the set of latent variables tnk for each component to 
form the complete data (x,°, xj^*, t„fc, z„) for n = 1, • • • , N and k = 1, - ■ ■ ,K. 

We derive an iterative EM algorithm for the PPCA mixture model with missing data. The key difference 
from the EM algorithm for completely observed data is that the conditional expectation is taken with 
respect to x" as opposed to x in the expectation steps. 

To develop an EM algorithm, we employ and extend the two step procedure as described in p6| . In 
the first stage of the algorithm, the component weights vr^ and the component center /x^ are updated: 

^k=-^^{Znk), (1) 
n 

^ «"■> 

where {znk) = P{znk = is the responsibility of mixture component k for generating the unit 

x„, and (x™") = E[x™"|2;nfc = is the conditional expectation. Note that we are not assuming 

the vectors in the bracket are stackable. This notation can be replaced by the true component ordering 
without difficulty. 

In the second stage, we update and cj^: 

Wfc =SkWk{all + M-iW^SfcWfc)-i, (3) 

dl =^tr (Sfe - SfeWfcM^ iW^) (4) 

from local covariance matrix S^: 

Sfc = ^{Znk){{^n - P'k){^n - P-kf)- 

These update rules boil down to the update rules for completely observed data when there are no missing 
variables. We derive the EM algorithm in detail in Appendix [A] 

After model parameters are estimated, the observations are divided into groups according to their 
posterior distribution: 

argmaxp(z„fc = l|x°"), 

k=l,-K 

so each unit (cell) is classified into one of K cell populations. Note that this posterior probability is 
computed in the E-step. 
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Fig. 8. Types of white blood cells. Each cell type is characterized by a set of expressed CD markers. The cluster of differentiation 
(CD) markers are commonly used to identify cell surface molecules on white blood cells. The '+/— ' signs indicate whether a 
certain cell type has corresponding antigens on the cell surface. 



C. Domain Knowledge and Initialization of EM algorithm 

In file matching of flow cytometry data, domain knowledge is critical. First, as explained above, the 
incompletely observed data is insufficient to determine the correct cluster labeling. Second, the initial 
conditions of the EM algorithm affect its performance and convergence rate. Domain knowledge allows 
us to choose the number of components, and to initialize the algorithm so that it converges to the best 
local maximum. 

In flow cytometry, from the design of fluorochrome marker combinations and the knowledge about 
the blood sample composition, we can anticipate certain properties of cell subpopulations. For example. 
Fig. [8] summarizes white blood cell types and their characteristic cluster of differentiation (CD) marker 
expressions. That these are six cell types suggests choosing K = 6 when analyzing white blood cells. 

The CD markers indicated are commonly used in flow cytometry to identify cell surface molecules on 



leukocytes |18|. However, this information is qualitative, and needs to be quantified. 

To achieve this, we use one dimensional histograms. In a histogram, two large peaks are generally 
expected depending on the expression level of the corresponding CD marker. If a cell subpopulation 
expresses a CD marker, denoted by '+', then it forms a peak on the right side of the histogram. On the 
other hand, if a cell population does not express the marker, denoted by '— ', then a peak can be found 
on the left side of the histogram. We use the locations of the peaks to quantify the expression levels. 

These quantified values can be combined with the CD marker expression levels of each cell type to 
specify the initial cluster centers. Thus, each component of /.i^ of a certain cell type is initialized by 
either the positive quantity or the negative quantity from the histogram. In our implementation, these are 
set manually based on visual inspection of histograms. Then we initialize the mixture model parameters 
{vTjt, fij^, Wfc, cr^} as described in Fig. M 
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Input: A'l, X2 data files ; K the number of components ; q the dimension of latent variable space ; fi/^ 
for initial component mean, 
for A; = 1 to do 

1. using distance ||x°" — /x^" ||, find the set of data points whose nearest component mean is fij^ 

2. initialize a covariance matrix with random entries 

3. replace submatrices of Cfc with sample covariance of data points in X^ 

4. make Ck positive definite by enforcing the eigenvalues to be positive 

0. set TTfc — 

6. set Wfc with the q principal eigenvectors of Ck 

7. set al with the average of remaining eigenvalues of 
end for 

Output: {vTfc, fi^, Wk, al} for k = 1, - ,K 

Fig. 9. Parameter initialization of an EM algorithim for missing data. Cell populations are partitioned into K groups based 
on the distance to each component center. The component weight nk is initialized according to the size of each partition. From 
the covariance matrix estimate bCk, parameters bWk and (t| are initialized by taking eigen-decomposition. 

C Si S2 

C 
Si 
S2 

Fig. 10. Structure of covariance matrix C. The sub-matrices C^^ "^ and C^^'"^ cannot be estimated from a sample covariance 
matrix because these variables are never jointly observed. 

An important issue in file matching arises from the covariance matrix. When data is completely 
observed, a common way of initialization of a covariance matrix is using a sample covariance matrix. In 
the case of file matching, however, it cannot be evaluated since some sets of variables are never jointly 
observed (see Fig. [TO] ). We chose to build a covariance matrix Ck from variable to variable with sample 
covariances. For example, we can set C^''^^ with the sample covariance for variables c and si based on 
cases for which both variables c and si are present. On the other hand, the submatrix C^,^'*^ cannot 
be built based on the observation. In our implementation, we set the submatrix C^^'^^ with arbitrary 



March 30, 2010 



DRAFT 



15 



ID 


Ni 


N2 




Patient 1 


10000 


10000 


5223 


Patient2 


7000 


7000 


4408 


Patient3 


3000 


3000 


3190 



Fig. 11. Three flow cytometry datasets from three different patients. Each dataset is divided into two data files and an evaluation 
set. A^i and denote the size of two data files and A'^e is the size of evaluation set. 

FS SS CD56 CD 16 CD3 CDS CD4 

file 1 
file 2 



Fig. 12. File structure used in the experiment. FS, SS, and CD56 are common in both files, and a pair of CD markers are 
observed in only one of the files. The blank blocks correspond to the unobserved variables. The blocks in file I are matrices 
with TVi rows, and the blocks in file 2 are matrices with 7V2 rows. 



values. However, the resulting matrix may not be positive definite. Thus, is made positive definite 
by replacing negative eigenvalues with a small positive value. Once a covariance matrix Cfc is obtained, 
we can initialize and by taking eigen-decomposition of Ck- 

V. Experiments and Results 

We apply the proposed file matching technique to real flow cytometry datasets, and present experimental 
results. 

Three flow cytometry datasets are prepared from lymph node samples of three patients. These datasets 
were provided by the Department of Pathology at the University of Michigan. The measurements are of 
different sizes and have seven attributes: FS, SS, CD56, CD 16, CDS, CDS, and CD4. Each dataset is 
randomly permuted ten times and divided into two data files and a separate evaluation set. In Fig. [TTJ the 
cell counts of the two files and the held-out set are denoted Ni, N2, and A'^e, respectively. Two attributes 
from each file are made hidden to construct hypothetical files with missing data. Thus, CD 16 and CD3 
are available only in file 1, and CDS and CD4 are available only in file 2, while FS, SS, and CD56 are 



common. The pattern of the constructed data files is illustrated in Fig. 12 where the blocks of missing 
variables are left blank. 

For each white blood cell type, its expected marker expressions (CD markers), relative size (FS), 



and relative granularity (SS) are presented in Fig. 13 Because it is from a lymph node sample, the 



majority of cell population is lymphocytes, while the most common white blood cells in a human body 
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Fig. 13. Cell types in the dataset and their corresponding marker expressions. '+' or '-' indicates whether a certain cell type 
expresses the CD marker or not. 




Fig. 14. Histogram of each marker in the dataset. The peaks are hand-picked and are indicated in each panels. 



are granulocytes. The '+/— ' signs indicate whether a certain cell type expresses the markers or not. 
For example, helper T cells express both CD3 and CD4 but not others. This qualitative knowledge is 



quantified with the help of single dimensional histograms as explained in Section IV-C Two dominant 
peaks are picked from each histogram and their corresponding measurement values are set to the positive 



and negative expression levels. Fig. 14 and Fig. [15] summarize this histogram analysis. 

Two incomplete data files are completed following the procedure as described in Fig. |7] A mixture 
of PPCA is fitted with six components because six cell types are expected from this dataset. The latent 
variable dimension of each PPCA component is fixed to two. 
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Fig. 15. The positive and negative expression levels are summarized. 

The synthesized data after file matching is displayed in Fig. |5] The figure shows scatter plots of specific 
variables: CD16, CD3, CD4, and CDS. Note that these marker pairs are not jointly observed from the 
two incomplete data files. The imputation results from the NN and the Cluster-based NN methods are 
compared in the figure. For reference, scatter plots from the original complete dataset (ground truth) 
are also presented. As can be seen, the results from the Cluster-based NN are far more similar to the 
true distributions. On the other hand, the results from the NN method generates spurious clusters in the 
CD3-CD8 and CD3-CD4 scatter plots. In Fig. [5] these false clusters are indicated. 

A. Evaluation method 

To quantitatively evaluate the previous results, we use KuUback-Leibler (KL) divergence. The KL 
divergence between two distribution /(x) and g{-x) is defined by 

KL{g\\f)=¥.g [log<7-log/]. 

Let / denote a true distribution responsible for the observations, and g denote its estimate. 

The KL divergence is not symmetric, so KL{f \\ g) and KL{g \\ f) have different meanings. For a 
given distribution /, a distribution g minimizes KL{f \ \ g) when g takes nonzero values in the region 
where / takes nonzero values; hence, it overestimates the support of /. On the other hand, KL{g || /) is 
minimized for g that is close to zero in the region where / is near zero. A distribution g that minimizes 
KL{g II /) tends to have smaller support. Therefore, KL{g \\ f) is a better evaluation method for detecting 
spurious clusters in an estimate. 

Then the empirical estimate of the KL divergence is evaluated by 

KLig II /) « KL{g\\ / ) « ^ J] [log?(X„) - log / (X.) . 

^ n=l 

where the distributions / and g are replaced by their corresponding density estimates, and the expectation 
is approximated by a finite sum over imputed results on the held-out validation set of size N,.. 

We used kernel density estimation on the ground truth data and the imputed data for / and g, 
respectively. The KL divergences are computed for ten random permutations, and their averages and 
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ID 


NN (file 1) 


Cluster-NN (file 1) 


NN (file 2) 


Cluster-NN (file 2) 


Patient 1 


2.90 ± 0.05 


1.55 ± 0.05 


2.66 ± 0.03 


1.12 ± 0.04 


Patient2 


4.54 ± 0.07 


1.22 ± 0.03 


4.12 ± 0.08 


0.92 ± 0.03 


Patients 


4.46 ± 0.10 


2.40 ±0.11 


4.18 ± 0.11 


2.30 ± 0.07 



Fig. 16. The KL divergences are computed for ten permutations of each flow cytometry dataset. The averages and standard 
errors are reported in the table. For both the NN and Cluster-based NN algorithm, the file matching results are evaluated. The KL 
divergences of Cluster-based NN are closer to zero than those of NN. Thus the results from Cluster-based NN better replicated 
the true distribution. 



Standard errors are reported in Fig. 16 As can be seen, the KL divergences from Cluster-based NN are 
substantially smaller than those from NN. Therefore, the Cluster-based NN yields a better replication of 
true distribution. 



VI. Discussion 

In this paper, we demonstrated the use of a cluster-based nearest neighbor imputation method for tile 
matching in flow cytometry data. We applied the proposed algorithm on real flow cytometry data to 
generate a dataset of higher dimensions by merging two data files of lower dimensions. The resulting 
matched file can be used for visualization and high-dimensional analysis of cellular attributes. 

While the presented imputation method focused on the case of two files, it can be generalized to more 
than two files. For each missing component of a recipient cell, we can find a donor cell among files that 
have the component of interest. We envision two extensions of the clustering-based imputation method. 
The first is training a PPCA mixture model on all the data files. This approach involves the entire data 
points for model fitting. The second method considers a pair of files at a time. In this approach, we 
first select a donor file in which the missing component of the recipient file is available. Then we apply 
method of this paper to the pair of files. This approach involves smaller number of data points in training, 
but mixture models of smaller dimensions need to be fitted multiple times. After training of a mixture 
model and clustering of each cell, the similarity between cells can be computed. The Euclidean distance 
on the projected space of commonly observed variables can be used to find the similarity under the 
constraint that both units should have the same cluster label. The missing components are then imputed 
from the donor. 

Future research directions include finding ways of automatic prior information extraction. The con- 
struction of covariance matrices from incomplete dataset in the initialization of the EM algorithm is also 
an interesting problem. We expect that better covariance structure estimation will be helpful for better 
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replication of non-symmetric and non-elliptic cell populations in the imputed results. 

A limitation of this work is that it has only been validated on lymphocyte data, where, for certain 
marker combinations, cell types tend to form relatively well-defined clusters. However, for other samples 
and marker combinations, clusters may be more elongated or less well-defined due to cells being at 
different stages of physiologic development. Future work may also consider more flexible models for 
clustering such data, and associated inference algorithms. 

Appendix 

Suppose that we are given an incomplete observation set. We can divide each unit x„ as x„ = 

by separating the observed components and the missing components. Note that we do not assume that 
the observed variables are first, and the notation can be replaced by the actual ordering of components 
without difficulty. 

In the PPCA mixture model, the probability distribution of x is 

K 
k=l 

where K is the number of components in the mixture and Tr^ is a mixing weight corresponding to the 
component density p(x|fc). We estimate the set of unknown parameters 6 = {vrfc, /x^, W^, (t|} using an 
EM algorithm from the partial observations {x°^, • • • , x^}. 

To develop an EM algorithm, we introduce indicator variables z„ = {zni, • • • , Zjik) for n = 1, • • • , N. 
One and only one entry of z„ is nonzero, and Znk = 1 indicates that the kth component is responsible 
for generating x„. We also include a set of the latent variables t„fc for each component, and missing 
variables x^" to form the complete data (x°" , x™" , tnk, z„) for n = 1, • • • , and k = 1, - ■ ■ ,K. Then 
the corresponding complete data likelihood function has the form: 



-Cc = ^ ^ Znk In [iTkPi^n, t 



nk) 



n k 
^ ^nfc 



InvTfc - 1^ Inal - -^tr [(x„ - /Xfc)(x„ - /i^)^] 



2 ^ 2al 



+ [i^n - t^kKk^l] - ^tr [WlWktnktlki 



where terms independent of the parameters are not included in the second equality. Instead of developing 
an EM algorithm directly on this likelihood function Cc^ we extend the strategy in | [T6| and build a 
two-stage EM algorithm, where each stage is a two-step process. 
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In the first stage of the two stage EM algorithm, we update the component weight -k^ and the component 
mean Hj^. We form a complete data log-hkehhood function with the component indicator variables z„ 
and missing variables while ignoring the latent variables tnk- Then we have the following hkelihood 
function: 

N K 
n=l k=l 

n k 

where terms unrelated to the model parameters are omitted in the second line. We take the conditional 
expectation with respect to p(zji,x™"|x°"). Since the conditional probability factorizes as 

p{Zn, X^f " |X°") = p(z„|x^'')p(x;^'' \Zn, X^"), 



luTTfe - ^ln|Cfe| - ^tr[C^^(x„-/ife)(xn-/[ijfc)^] 



the next conditional expectations follow 

{Znk) =P(^K") 



7rfep(x°"|/c) 



Efc'7rfc'P(xn"|A;')' 



- C 



where (•) denote the conditional expectation. Maximizing (jCi) with respect to tt^, using a Lagrange 
multipher, and with respect to iJ,k give the parameter updates 

j^^i^nk), (5) 



X" 



(6) 



In the second stage, we include the latent variable ink as well to formulate the complete data log- 
likelihood function. The new values of tt^ and /i^, are used in this step to compute sufficient statistics 
Taking the conditional expectation on Cc with respect to p(z^, t„fe, x^" |x°"), we have 



n k 
1 



^tr [((x„ - /ijk)(x„ - fii^f)] 



+ -2tr[((xn-MfeKifc)Wr 



tT[WlWk{tnktlk)] 
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( 




- Z^fc + 














Qnk 



Since the the conditional probability factorizes 

Pi'^m t^fc, X„ " |x„") = p(z„|x„" )p(x„ " |z„, X,„" )p(t„fc|z„, X„" , X^ 

we can evaluate the conditional expectations as follows : 

Qnk 

((Xn - fik)^lk) =((x„ - Afc)(x„ - /2fc)'^>WfeM~\ 

(t„,0 =alM,' + M-iwn(xn - /ifc)(xn - /lfc)^>WfcM-i. 

Remember that the q x q matrix = W^W^ + a^I. Then the maximization of (Cc) with respect to 
Wfc and £7^ leads to the parameter updates, 



tin„m„ 
'k 



QiO„m„ 



J]](z„fc)((x„ -/Ifc)t 



n/c/ 



nk^ 



J]](2;nfc>tr [((x„ - /Ifc)(x„ - /2fc)'^>] 



dY^ni^nk) 

- 2 ^{znk)tr [((x„ - fikKk)^l 



(7) 
(8) 



+ J](z„fc)tr [W^Wfc(t„fct 



nfc/ 



Substituting the conditional expectations simplifies the M-step equations 



where 



Xl^^'^'^^^^^" ~ /^fc)(Xn - /ifc)'^ 



(9) 
(10) 



NTTk 

Each iteration of the EM algorithm updates the set of old parameters {vrfe, /x^,, W^, o"^} with the set of 
new parameters {77^, /x^, W^, as in ([5]), ([6]), (|9]l, and ( fTO] ). The algorithm terminates when the value 
of the log-likelihood function no longer changes. 
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